Assessing Identification Risk in Survey Microdata

نویسندگان

  • CHRIS SKINNER
  • NATALIE SHLOMO
چکیده

This article considers the assessment of the risk of identification of respondents in survey microdata, in the context of applications at the United Kingdom (UK) Office for National Statistics (ONS). The threat comes from the matching of categorical ’key’ variables between microdata records and external data sources and from the use of log-linear models to facilitate matching. While the potential use of such statistical models is well-established in the literature, little consideration has been given to model specification nor to the sensitivity of risk assessment to this specification. In numerical work not reported here, we have found that standard techniques for selecting log-linear models, such as chi-squared goodness of fit tests, provide little guidance regarding the accuracy of risk estimation for the very sparse tables generated by typical applications at ONS, for example tables with millions of cells formed by cross-classifying six key variables, with sample sizes of 10 or 100 thousand. In this article we develop new criteria for assessing the specification of a log-linear model in relation to the accuracy of risk estimates. We find that, within a class of ’reasonable’ models, risk estimates tend to decrease as the complexity of the model increases. We ∗Chris Skinner is Professor, Southampton Statistical Sciences Research Institute (S3RI), University of Southampton, United Kingdom. Natalie Shlomo is research student, S3RI, University of Southampton and Department of Statistics, Hebrew University, Jerusalem, Israel.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A CRONYM : Data without Boundaries D

Disclosure limitation methods for protecting the confidentiality ofrespondents in survey microdata often use perturbative techniques whichintroduce measurement error into the categorical identifying variables. Inaddition, the data itself will often have measurement errors commonly arisingfrom survey processes. There is a need for valid and practical ways to assess theprotect...

متن کامل

Assessing the Statistical Disclosure Risk of a Demographic Microdata File

There are two recent developments related to survey data dissemination that may be increasing the risk of disclosure of respondent data. One is that statistical agencies are now releasing more microdata files than previously, partly in response to the urging of researchers needing the data for precise analytic work. For example, some data rich files with possibly high disclosure risk, that have...

متن کامل

Assessing Microdata Disclosure Risk Using the Poisson-inverse Gaussian Distribution

An important measure of identification risk associated with the release of microdata or large complex tables is the number or proportion of population units that can be uniquely identified by some set of characterizing attributes which partition the population into subpopulations or cells. Various methods for estimating this quantity based on sample data have been proposed in the literature by ...

متن کامل

Assessing Disclosure Risk for Record Linkage

An intruder seeks to match a microdata file to an external file using a record linkage technique. The identification risk is defined as the probability that a match is correct. The nature of this probability and its estimation is explored. Some connections are made to the literature on disclosure risk based on the notion of population uniqueness.

متن کامل

Using Mahalanobis Distance-Based Record Linkage for Disclosure Risk Assessment

Distance-based record linkage (DBRL) is a common approach to empirically assessing the disclosure risk in SDC-protected microdata. Usually, the Euclidean distance is used. In this paper, we explore the potential advantages of using the Mahalanobis distance for DBRL. We illustrate our point for partially synthetic microdata and show that, in some cases, Mahalanobis DBRL can yield a very high re-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006